{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Author:__ Bram Van de Sande\n",
    "    \n",
    "__Date:__ 14 JUN 2019\n",
    "\n",
    "__Outline:__ Notebook generating list of Transcription Factors (TFs) for human and mouse. These lists can be used for the network inference step of SCENIC (step 1 - GENIE3/GRNBoost2)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__DATA ACQUISITION:__\n",
    "1. Download motif annotations for _H. sapiens_ - HGNC symbols: `wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v9-nr.hgnc-m0.001-o0.0.tbl`\n",
    "2. Download motif annotations for _M. musculus_ - MGI symbols: `wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v9-nr.mgi-m0.001-o0.0.tbl`\n",
    "3. Download list of curated human transcription factors from: Lambert SA et al. The Human Transcription Factors. Cell 2018 https://dx.doi.org/10.1016/j.cell.2018.01.029 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "BASEFOLDER_NAME = '../resources/'\n",
    "\n",
    "MOTIFS_HGNC_FNAME = os.path.join(BASEFOLDER_NAME, 'motifs-v9-nr.hgnc-m0.001-o0.0.tbl')\n",
    "MOTIFS_MGI_FNAME = os.path.join(BASEFOLDER_NAME, 'motifs-v9-nr.mgi-m0.001-o0.0.tbl')\n",
    "CURATED_TFS_HGNC_FNAME = os.path.join(BASEFOLDER_NAME, 'lambert2018.txt')\n",
    "\n",
    "OUT_TFS_HGNC_FNAME = os.path.join(BASEFOLDER_NAME, 'hs_hgnc_tfs.txt')\n",
    "OUT_TFS_HGNC_FNAME = os.path.join(BASEFOLDER_NAME, 'hs_hgnc_curated_tfs.txt')\n",
    "OUT_TFS_MGI_FNAME = os.path.join(BASEFOLDER_NAME, 'mm_mgi_tfs.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__MUS MUSCULUS__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1721"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_motifs_mgi = pd.read_csv(MOTIFS_MGI_FNAME, sep='\\t')\n",
    "mm_tfs = df_motifs_mgi.gene_name.unique()\n",
    "with open(OUT_TFS_MGI_FNAME, 'wt') as f:\n",
    "    f.write('\\n'.join(mm_tfs) + '\\n')\n",
    "len(mm_tfs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__HOMO SAPIENS__\n",
    "\n",
    "List of TFs based on motif collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1839"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_motifs_hgnc = pd.read_csv(MOTIFS_HGNC_FNAME, sep='\\t')\n",
    "hs_tfs = df_motifs_hgnc.gene_name.unique()\n",
    "with open(OUT_TFS_HGNC_FNAME, 'wt') as f:\n",
    "    f.write('\\n'.join(hs_tfs) + '\\n')\n",
    "len(hs_tfs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "List of TFs based on Lambert SA et al."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1639"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open(CURATED_TFS_HGNC_FNAME, 'rt') as f:\n",
    "    hs_curated_tfs = list(map(lambda s: s.strip(), f.readlines()))\n",
    "len(hs_curated_tfs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "List of human curated TFs for which a motif can be assigned based on our current version of the motif collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1390"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hs_curated_tfs_with_motif = list(set(hs_tfs).intersection(hs_curated_tfs))\n",
    "len(hs_curated_tfs_with_motif)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(OUT_TFS_HGNC_FNAME, 'wt') as f:\n",
    "    f.write('\\n'.join(hs_curated_tfs_with_motif) + '\\n')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.6 (pyscenic_dev)",
   "language": "python",
   "name": "pyscenic_dev"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
